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Abstract 

This paper is concerned with the rehable inference of optimal tree- 
approximations to the dependency structure of an unknown distribution gen- 
erating data. The traditional approach to the problem measures the depen- 
dency strength between random variables by the index called mutual infor- 
mation. In this paper reliability is achieved by Walley's imprecise Dirichlet 
model, which generalizes Bayesian learning with Dirichlet priors. Adopting the 
imprecise Dirichlet model results in posterior interval expectation for mutual 
information, and in a set of plausible trees consistent with the data. Reliable 
inference about the actual tree is achieved by focusing on the substructure 
common to all the plausible trees. We develop an exact algorithm that infers 
the substructure in time O(m^), m being the number of random variables. 
The new algorithm is applied to a set of data sampled from a known distribu- 
tion. The method is shown to reliably infer edges of the actual tree even when 
the data are very scarce, unlike the traditional approach. Finally, we provide 
lower and upper credibility limits for mutual information under the imprecise 
Dirichlet model. These enable the previous developments to be extended to 
a full inferential method for trees. 
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1 Introduction 

This paper deals with the following problem. We are given a random sample of n 
observations, which are jointly categorized according to a set of m nominal random 
variables i, j, k, etc. The dependency between two variables is measured by the 
information-theoretic symmetric index called mutual information [Kul68]. If the 
chances^ tt of all instances defined by the co-occurrence of i = i^ J = j, K = k, etc., 
were known, it would be possible to approximate the distribution by another, for 
which all the dependencies are bivariate and can graphically be represented as an 
undirected tree T, that is the optimal approximating tree-dependency distribution 
(Section 2). This result is due to Chow and Liu [CL68], who use KuUback-Leiber's 
divergence [KL51] to measure the similarity of two distributions. 

Since only a sample is available, the joint distribution tt is unknown and an 
inferential approach is necessary. Prior uncertainty about the vector tt is described 



-^We denote vectors by x:={xi,...,Xd) for x£{n,t,u,-K,...}. 



Robust Inference of Trees 



3 



by the imprecise Dirichlet model (IDM) [Wal96]. This is an inferential model that 
generalizes Bayesian learning with Dirichlet priors, by using a set of prior densities 
to model prior (near-)ignorance. Using the IDM results in posterior uncertainty 
about TT, the mutual information and the tree T (Section 2). In general, this makes 
a set of trees T consistent with the data. 

Robust inference about T is achieved by identifying the edges common to all the 
trees in T, called strong edges (Sections 3). An exact and an approximate algorithm 
are developed that detect strong edges in times 0{m*) and O(m^), respectively. 
The former is applied to a set of data sampled from a known distribution, and 
is compared with the original algorithm from Chow and Liu (Section 5). The new 
algorithm is shown to reliably infer partial trees (we call them forests), which quickly 
converge to the actual complete tree as the sample grows. Unlike the traditional 
approach based on precise probabilities, the new algorithm avoids drawing wrong 
edges by suspending the judgement on those for which the information is poor. 

Many technical issues are addressed in the paper to develop the new algorithm. 
The identification of strong edges involves solving a problem on graphs. We develop 
original exact and approximate algorithms for this task in Section 3. Robust in- 
ference involves computing bounds for the lower and upper expectation of mutual 
information under the IDM (Section 4). We provide conservative (i.e., over-cautious) 
bounds that at most make an error of magnitude 0(n~^). 

These results lead to important extensions, reported in Section 6. Inference 
on mutual information is extended by providing lower and upper credibility limits 
under the IDM (i.e., intervals that depend on a given guarantee level). The overall 
approach extends accordingly. Furthermore, we discuss alternatives to the strong 
edges algorithm proposed in this paper, aiming to exploit the results presented here 
in wider contexts. 

To our knowledge, the literature only reports two other attempts to infer robust 
structures of dependence. Kleiter [Kle99] uses approximate confidence intervals on 
mutual information^ to measure the dependence between random variables. Kleiter's 
work is different in spirit from ours. We look for tree structures that are optimal in 
some sense, by using systematic and reliable interval approximations to the actual 
mutual information. Kleiter focuses on general graphical structures and is not con- 
cerned with questions of optimality. Bernard [BerOl] describes a method to build 
a directed graph from a multivariate binary database. The method is based on the 
IDM and Bayesian implicative analysis. The connection with our work is looser 
here since the arcs of the graph are interpreted as logical implications rather than 
probabilistic dependencies. 



^Notc that accurate expressions for credible mutual information intervals have been derived in 
[HutOl, HZ05]. 
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2 Background 

2.1 Maximum spanning trees 

This paper is concerned witii trees. In tlie undirected case, trees are undirected 
connected graphs with m nodes and m — 1 edges. Undirected trees are such that for 
each pair of nodes there is only one path that connects them [PS82, Proposition 2]. 
Directed trees can be constructed from undirected ones, orienting the arrows in such 
a way that each node has at most a single direct predecessor (or parent). When used 
to represent dependency structures, the nodes of a tree are regarded as random 
variables and the tree itself represents the dependencies between the variables. It 
is a well-known result that all the directed trees that share the same undirected 
structure represent the same set of dependencies [VP90]. This is the reason why the 
inference of directed trees from data focuses on recovering the undirected structure; 
and it is also the reason why this paper is almost entirely concerned with undirected 
trees (called more simply 'trees' in the following). 

Chow and Liu [CL68] address the problem of approximating the actual pattern 
of dependencies of a distribution by an undirected tree. Their work is based on 
mutual information. Given two random variables t, j with values in {l,...,di} and 
{l,...,dj}, respectively, the mutual information is defined as 



chances. Chow and Liu's algorithm works by computing the mutual information 

for all the pairs of random variables. These values are used as edge weights in a 
fully connected graph. The output of the algorithm is a tree for which the sum of 
the edge weights is maximum. In the literature of graph algorithms, the general 
version of the last problem is called the maximum spanning tree [PS82, p. 271]. Its 
construction takes 0{m'^) time. This is also the computational complexity of the 
above procedure. The tree constructed as above is shown to be an optimal tree- 
approximation to the actual dependencies when the similarity of two distributions 
is measured by KuUback-Leiber's divergence [KL51]. 

Chow and Liu extend their procedure to the inference of trees from data by 
replacing the mutual information with the sample mutual information (or empirical 
mutual information). This approximates the actual mutual information by using, 
in the expression for mutual information, the sample relative frequencies instead of 
the chances TTij, which are typically unknown in practice. 

2.2 Robust inference 




where Tr^ is the actual chance of {i,j) , and iTi^ S.-TTjj and n. 



^^TTjj are marginal 



Using empirical approximations for unknown quantities, as described in the previous 
section, can lead to fragile models. Fragile models produce quite different outputs 
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depending on the random fluctuations involved in the generation of the sample. 

Reliability can be achieved by robust inferential tools. In this paper we consider 
the imprecise Dirichlet model [Wal96, Ber05] . The IDM is a model of inference for 
multivariate categorical data. It models prior uncertainty using a set of Dirichlet 
prior densities and does posterior inference by combining them with the likelihood 
function (see Section 4.1 for details). The IDM rests on very weak prior assumptions 
and is therefore a very robust inferential tool. 

The IDM leads to lower and upper expectations for mutual information (and, 
possibly, lower and upper credibility limits), i.e., to intervals. This is a complication 
for the discovery of tree structures from data. In fact, the maximum spanning tree 
problem assumes that the edge weights can be totally ordered. Now, multiple values 
of mutual information are generally consistent with the given intervals. In general, 
this prevents us from having a total order on the edges: not all the pairs of edges 
can be compared. 

The generalization of Chow and Liu's approach is achieved via the definition of 
more general graphs that can deal with multiple edge weights. This is done in the 
next section. 

3 Set-based weighted graphs 

Consider an undirected fully connected graph Gw=< V,E>, with m = \V\ nodes, 
and where E denotes the set of edges [{v,v)^E for each veV]. Gyj is also a weighted 
graph, in the sense that each edge e&E is associated with the real number w{e), 
which in this paper will be a value of mutual information. Consider a set of graphs 
with the same topological structure but different weight functions w; in a non-empty 
set W: Q = {Gu;-w eW}. We call Q a set-based weighted graph. Note that Q can be 
thought of also as a single graph G, on each edge e of which there is a set of real 
weights: {w{e):weW}. Yet, for the latter view to be equivalent to the former, one 
should pay attention to the fact that there could be logical dependencies between 
weights of two different sets; in other words, it could be the case that not all the 
pairs of weights in the cartesian product of two sets appear in a single graph of Q. 

In order to extend the notion of maximum spanning tree to set-based weighted 
graphs, we define the solution of the maximum spanning tree problem generalized 
to set-based weighted graphs, as the set T of maximum spanning trees originated 
by the graphs in Q. 

Recall that Kruskal's algorithm only needs a total order on the edges to build 
a unique maximum spanning tree [KJ56]. Therefore, in order to focus on T, we 
can equivalently focus on the set Or of total orders that are consistent with the 
graphs in ^. In the following we find it more convenient not to directly deal with 
Or- Rather, we first show how to construct a partial order that is consistent with 
all the total orders in Or, and then we consider all the total orders that extend the 
partial order. Initially, we need the following definition. 
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Definition 1. We say that edge e dominates edge e' ifw{e) >w{e') for all w&W . 

By applying the above definition to all the distinct pairs of edges in G we obtain 
the sought partial order. To see that the order is only partial in general, consider 
the example graph in Figure 1. We have defined such a graph G by drawing the 
graphical structure and specifying set-based weights by placing intervals on the edges 
in a separate way (i.e., assuming logical independency between different intervals). 
That is, the example graph is equivalent to the set Q of graphs obtained by choosing 
real weights within the intervals in all the possible ways. Now observe that the 
intervals for the edges (A,B) and (B,C) overlap, so that there is no dominance in 
either direction. Figure 2 shows the overall partial order on the edges for the graph 
in Figure 1. 



Figure 1: An example set-based weighted graph. The sets for the edges are specified 
separately by intervals that in two cases degenerate to real numbers. 



Figure 2: The partial order on the edges of the graph in the preceding figure. Here 
an arrow from e to e' means that e dominates e'. 



Now we consider the set O of all the total orders that extend the partial order 
induced by Definition 1. Of course, O includes Or- They coincide if for each total 
order in O, there is a graph G^^Q in which w(e)>w(e') if e dominates e' in the given 
total order. This is the case, for example, when mutual information is separately 
specified via intervals on the edges. 




(A,D) 




(B,C) 
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3.1 Exact detection of strong edges 

We call strong edges the edges of G that are common to all the trees in T. Identifying 
the strong edges allows us to robustly infer dependencies that belong to the unknown 
optimal approximating trees. The following theorem is the central tool for the 
identification. 

Theorem 2. Assume O = Or- An edge e of G is strong if and only if in each 
simpl^ cycle that contains e there is an edge e' dominated by e. 

Proof. □ 

(■<=) By contradiction, assume that there is a graph G^eQ for which an optimal 
tree T docs not contain e. By adding e to T we create a cycle [PS82, Proposition 2]. 
By hypothesis, in such a cycle there must exist an edge e' dominated by e, so 
w{e) >w{e'). Removing e', we obtain a new tree that improves upon T, so that T 
cannot be optimal for G^. 

(^) By contradiction, assume that there is a cycle C in G where e does not 
dominate any edge. Then there is a total order in O in which e is dominated by 
any other edge e' in C. Since O — Or, there must also exist a related graph Gy, 
for which w{e) <w{e') for any edge e' in C. Call T the related tree. By removing 
e from T wc create two subtrees, say T' and T". One of these can possibly be a 
degenerate tree composed by a single node. Now consider that there must be an 
edge ec of G that connects a node of T' with one of T". If there was not, there 
would be no way to start from an endpoint of e in T' and reach the other endpoint, 
because all the paths would be confined within T'. The graph composed by T', T" 
and 6(7 has m — 1 edges, spans all the nodes of G, and therefore it is a tree, say T* 
[PS82, Proposition 2]. If w{e) <w{ec), T* improves upon T, so that T cannot be 
optimal for Gyj. If w(e) =w{ec)-i both T* and T are optimal, but their intersection 
does not contain e, so e^T. 

Theorem 2 directly leads to a procedure that determines whether or not a given 
edge e is strong. It suffices to consider the graph G' obtained from G by removing e 
and the edges that e dominates (see the Procedure 'DetectStrongEdges' in Table 3). 
Edge e is strong if and only if its endpoints are not connected in G' . By applying 
this procedure to the graph in Figure 1, we conclude that only (A,B) is strong. 

Note that Theorem 2 assumes that O coincides with Or- If this failed to be 
true. Or C O would still hold, making Theorem 2 work with a set of trees larger 
than T, eventually leading to an excess of caution: the edges determined by the 
above procedure would anyway be strong, but there might be strong edges that the 
procedure would not be able to determine. 

As for computational considerations, note that testing whether or not two nodes 
are connected in a graph demands O(m^) time. Repeating the test for all the edges 
ee£', we have the computational complexity of the overall procedure, O(m^). 

^This is a cycle in which the nodes are all different. In the following we will simply refer to 
simple cycles as cycles. 
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1. Let SE^$; 

2. for each vgV 

(a) if there is a node v' eV such that {v,v')^SE and it dominates {v,v") for 
each v" &V, v"^v' then 

i. add {v,v') to SE] 

3. if there is a subtree in SE then 

(a) make it the current subtree; 

(b) consider the set of edges E' CE with one endpoint in the nodes of the 
current subtree and the other outside; 

(c) if there is an edge e' e E' that dominates all the other edges in E' then 

i. add e' to SE and to the current subtree; 

ii. go to 3b; 

(d) else 

i. if there is another subtree in SE not considered yet then 
A. go to 3a; 

ii. else output SE. 

Table 1: Approximate procedure to detect strong edges. 
3.2 Approximate detection of strong edges 

This section presents a procedure that approximately detects the strong edges, re- 
ducing the complexity to O(m^) with respect to the exact procedure given in Sec- 
tion 3.1. 

Consider the algorithm outlined in a pseudo programming language in Tabic 1. 
It takes as input a fully connected graph G=<V,E>. In the algorithm, a tree with 
a number of nodes in {2,...,m — 1} is called subtree. 

The following proposition shows that the algorithm in Table 1 returns only strong 
edges. 

Proposition 3. SE is a subset of the strong edges ofG. 

Proof. Consider the first possible insertion in Step 2(a)i. The cycles that encompass 
{v,v') must pass through the set of edges {{v,v") : v" e V,v" ^ v'^. Since (f 
dominates all the edges in the preceding set, for each cycle passing through (f 
there is an edge in the cycle that is dominated by so that {v.,v') is strong, by 

Theorem 2. 
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The algorithm can insert an edge in SE also in Step 3(c)i. Recall that each 
subtree is a connected acyclic graph. It is clear that any cycle that contains e' must 
pass through an edge e" that has one endpoint in the nodes of the subtree and the 
other outside. But e' dominates e" by Step 3c. This holds for all the cycles, so e' is 
strong by Theorem 2. □ 



The logic of the algorithm in Table 1 is to move from subtrees made of strong 
edges to adjacent nodes, in order to detect the strong edges of a graph. This policy 
does not allow all the strong edges to be determined in general. For example, the 
approximate algorithm cannot determine that the edge (A,B) in Figure 1 is strong. 

The heuristic policy implements a trade-off between computational complexity 
and the capability to fully detect the strong edges. This choice does not seem 
critical to the specific extent of discovering tree-dependency structures. In fact, the 
knowledge of the actual mutual information increases with the sample size, becoming 
a number in the limit. It is easy to check that in these conditions the exact and the 
approximate procedure produce the same set of edges. 



Computational complexity. The assumption behind the following analysis is 
that the comparison of two edges can be done in constant time. In this case, given 
a set E' of edges, there is a procedure that determines in time 0{\E'\) if there 
is an edge e' e E' that dominates all the others. The first step of the procedure 
selects an edge that is candidate to be dominant. This is made by doing pairwise 
comparisons of edges and by always discarding the non-dominant edge (or edges) 
in the comparison. After at most — 1 comparisons, we know whether there is 
a candidate or not. If there is, the second step of the procedure compares such 
candidate e' with all the other edges, deciding if e' dominates all the others. This 
requires — 1 comparisons. The two steps of the procedure take 0{\E'\) time. 

Let us now focus on the algorithm in Table 1. The loop 2 is repeated m = \V\ 
times. Each time the test 2a decides whether there is a dominant edge out of m — 1 
edges (each node is connected to all the others). By the previous result, such task 
takes 0{m) time. Then the loop requires 0(171^) time. 

Now consider the two nested loops made by the instructions 3a, 3b, 3(c)ii, 
and 3(d)iA. Each time the instruction of jump 3(c)ii is executed, a new edge has 
been added to SE. Each time 3(d)iA is executed, a new subtree is considered. Since 
SE can have m — 1 edges at most and m is also an upper bound on the number of 
different subtrees, the two loops can jointly require 2m — 1 iterations at most. Each 
such iteration executes the test 3c. By using as an upper bound on \E'\, we 
need 0{m^) time to detect whether the dominant edge exists. The overall time 
required by the loops is 0{m^). This is also the computational complexity of the 
entire procedure. 
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4 Robust comparison of edges 

So far wc have focused on the detection of strong edges, taking for granted that 
there exists a method to partially compare edges based on imprecise knowledge of 
mutual information. We provide such a method in the following sections. We will 
first present a formal introduction to the imprecise Dirichlet model in Section 4.1. 
Section 4.2 will make a first step by computing robust estimates for the entropy. 
These will be used in Section 4.3 to derive robust estimates of mutual information. 
Finally, the method to compare edges will be given in Section 4.4. 

4.1 The imprecise Dirichlet model 

Random i.i.d. processes. We consider a discrete random variable i and a related 
i.i.d. random process with samples i e {l,...,d} drawn with probability tTj. The 
chances tt form a probability distribution, i.e., tt G A := {a; G iR^ : Xi>OVi, x+ = l}, 
where we have used the abbreviation x^:=Y^'l^-^Xi. The likelihood of a specific data 
set D = {ii,...,in) with rii samples i and total sample size n = n+ = J2i^i is p{D\7r)(x 
]J-7r"\ Quantities of interest are, for instance, the entropy H{n) = — ^.TTjlogTrj, 
where log denotes the natural logarithm. The chances tTj are usually unknown and 
have to be estimated from the data. 

Second order p(oste)rior. In the Bayesian approach one models the initial un- 
certainty in TT by a (second order) prior distribution p(7r) with domain ttG A. The 
Dirichlet priors p(7r) oc Yli^i ' j where comprises prior information, represent a 
large class of priors. n[ may be interpreted as (possibly fractional) "virtual" sample 
numbers. High prior belief in i can be modelled by large n\. It is convenient to write 
n\ — s-ti with s:—n'_^_, hence tG A. Examples for s are for Haldane's prior [Hal48], 
1 for Perks' prior [Per47], | for Jeffreys' prior [Jef46], and d for Bayes-Laplace's 
uniform prior [GCSR95] (all with tj = i). These are also called noninformative 
priors. From the prior and the data likelihood one can determine the posterior 
p{7r\D) =p(7r|n) oc Y[i'^i'~^^^'~'^ ■ The expected value or mean Ui := Et[ni] = ^^^^ 
is often used for estimating tTj (the accuracy may be obtained from the covariance 
of tt). The expected entropy is Et[H] = f^TC{7r)p{7r\n)d7v. An approximate so- 
lution can be obtained by exchanging E with Ti (exact only for linear functions): 
Et[H{7T)] ^7i{Et[TT]) = l-i{u). The approximation error is typically of the order 
In [WW95, HutOl, HZ05] exact expressions have been obtained: 



where '4){x) = d\o^{x) /dx is the logarithmic derivative of the Gamma function. 
There are fast implementations of ip and its derivatives and exact expressions for 
integer and half-integer arguments (see Appendix A). 




(1) 



h{u) 



u-[%lj{n + s + 1) — ■j/'((n + s)u + 1)], 
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Definition of the imprecise Dirichlet model. There are several problems with 
noninformative priors. First, the inference generally depends on the arbitrary defi- 
nition of the sample space. Second, they assume exact prior knowledge p(7r). The 
solution to the second problem is to model our ignorance by considering sets of pri- 
ors J5(7r), a model that is part of the wider theory of imprecise"^ probabilities [Wal91]. 
The specific imprecise Dirichlet model [Wal96] considers the set of all^ t G A, i.e., 
{p(7r) : t G A}, which solves also the first problem. Walley suggests to fix the hy- 
perparameter s somewhere in the interval [1,2]. A set of priors results in a set of 
posteriors, set of expected values, etc. For real-valued quantities like the expected 
entropy Et[7i] the sets are typically intervals: 

Et[H] e [min Et[n], max Et[n]] =: [H,H]. 

In the next section we derive approximations for 

H — max H(u) and H — min H(u). 
teA ^ ' — teA ^ ' 

One can show that h{u) is strictly concave (see Appendix A), i.e., h"{u)<0 and that 
h" is monotone increasing {h"'>0), which we exploit in the following. The results for 
the entropy serve as building blocks to derive similar results for the needed mutual 
information. We define the general correspondence 

... _ — where ■ ■ can be various superscripts. 

n + s 

4.2 Robust entropy estimates 

Taylor expansion of H(u). In the following we derive reliable approximations 
for H and H^. If n is not too small these approximations are close to the exact 
values. More precisely, the length of interval [ILiH] is 0{a), where cr:=^, while 
the approximations will differ from H and by at most 0{a'^). Let t* G [0,1] and 
K — "^^n+s^ ■ '^^^^ implies 

Ui — u* = a-{ti — t*) and \ui — u*\ = a\ti — t*\ < a. (2) 

Hence we may Taylor-expand H(u) around u*. H is approximately linear in u 
and hence in t. A linear function on a simplex assumes its extreme values at the 

'*In the following wc will avoid the term imprecise in favor of robust, since expressions like "exact 
imprecise intervals" sound confusing. 

^Strictly speaking, A should be the open simplex [Wal96], since p{tt) is improper for t on the 
boundary of A. For simplicity wc assume that, if necessary, considered functions of t can be, 
and are, continuously extended to the boundary of A, so that, for instance, minima and maxima 
exist. All considerations can straightforwardly, but cumbersomely, be rewritten in terms of an 
open simplex. Note that open/closed A result in open/closed robust intervals, the difference being 
numerically /practically irrelevant. 
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vertices of the simplex. The most natural point for expansion is t* = ^ in the center 
of A. For this choice the bound (2) and most of the following bounds can be 
improved to a^a\l — ^\. Other, even data-dependent choices like t* — ^ — u*, are 
possible. The only property we use in the following is that^ argmaxjii* = argmaxjrij 
and argminjU* = argminjni. We have 




i i 

For suitable Ui between u* and Ui this expansion is exact {Hr is the exact remainder). 
Approximation of H. Inserting (2) into Hi we get 

Hi = J2h'(u*)(u,-u*) = aJ2h\u*)(U-t*). 

i i 

Ignoring the 0((J^) remainder Hr, in order to maximize H{u) we only have to maxi- 
mize ^Ji' {u*)ti (the only t-dependent part). A linear function on A is maximized by 
setting the U component with largest coefficient to 1. Due to concavity of /i, h'{u*) 
is largest for the smallest u*, i.e., for smallest n,, i.e., for i = i :=argminjni. Hence 
Hi = Hiiu), where ti := S^j and u follows from t by the general correspondence. 
Hq+Hi is an O(cr^) approximation of H. Consider now the remainder Hr: 

Hr = la2^/i"(u,)|i,-i:r < H^' 

i 

due to /i"<0. This bound cannot be improved in general, since Hr = is attained 
for ti = t*. Non-positivity of Hr shows that H^ + Hi is an upper bound of H. Since 
H > H{u) for all u, H{u) in particular is a lower bound on H, and moreover also 
an 0{a'^) approximation. Together we have 

H{u) < H < Hq + Hi. 

For robust estimates, the upper bound is, of course, more interesting. 

Approximation of I^. The determination of Hi follows the same scheme as for 
Hi. We get Hi — Hi{u) with tf.— 5ii and ii^argmaxinj. Using i*| <1, Ui>:^, 
h" < and that h" is monotone increasing (h'" > 0) we get the following lower bound 
on the remainder Hr: 



lb 
R- 



^argmiriini is the i for which rii is minimal. Ties can be broken arbitrarily. Kronecker's 6ij = l 

for i=j and Sij = for i^j. 
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Putting everything together wc have 

H0 + H1+ K < H < H{u) . 

For robust estimates, the lower bound is more interesting. General approximation 
techniques for other quantities of interest are developed in [Hut03]. Exact expres- 
sions for \I£,H] are also derived there. 



4.3 Robust estimates for mutual information 

Mutual information. Here we generalize the bounds for the entropy found in 
Section 4.2 to the mutual information of two random variables t and j that take 
values in {l,...,di} and {l,...,dj}, respectively. Consider an i.i.d. random process 
with samples e {l,...,di} x {l,...,dj} drawn with joint probability TTij, where ttG 
A:={a;el?*^''^ : >OVij, x++ — l}. We are interested in the mutual information 
of I and j: 

d., _dj_ 



i=i j=i 

= TTij log TTij - 7ri+ log 7ri+ - Y log TT+j 

ij i j 

T^i+ = ^.j'^ij and 7^+j = Yli^ij marginal probabilities. Again, we assume a Dirichlet 
prior over tt^^, which leads to a Dirichlet posterior p(7Zij\n) oc Jlj^-vr^'-'^''*'^ ^. The 
expected value of Tr^ is 

77T r 1 ^ij ^^ij 

Uij := EtiTTijl = 

n + s 

The marginals TZi_^ and 7r+j are also Dirichlet with expectation Ui_^ and u^j. The 
expected mutual information Et[2] can, hence, be expressed in terms of the expec- 
tations of three entropies 

I{u) := H{u^+) + H{u+j) - H{u^j) = Hi^ft + Height - H joint 

= Y + E ~ E ^(^^j) 

i j ij 

where here and in the following we index quantities with joint, left, and right to 
denote to which distribution the quantity refers. Using (1) we get Et\I\ = I{u). 

Crude bounds for I{u). Estimates for the IDM interval [miutgA-^'tl^^], 
maxtgA£'t[2^]] can be obtained by minimizing/maximizing I{u). A crude upper 
bound can be obtained as 

7 := m^7(w) = T!lSO^[Hieft + Hright- H joint] < 
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max Hieft + max Bright — min H joint = Hie ft + H right — Kjoint^ 

where upper and lower bounds to Hieft, Bright and H_joint have been derived in 
Section 4.2. Similarly iLieft'^ iLright~ ^ joint- The problem with these bounds 
is that, although good in some cases, they can become arbitrarily crude. In the 
following we derive bounds similar to the entropy case with 0((7^) accuracy. 

0((7^) bounds for I{u). We expand I{u) around u* with a constant term /q, a 
term I\ linear in a and an exact 0{a'^) remainder. 

I{u) = Io + Ii + Ir, Iq = Hoieft + Horight — Hojoint = I{u*), 



h — Hiieft + Hiright — Hijoint 



Yl 3i3 i'tij - tij) with Qij := + h'{ulj) - h'{u*j). 



^3 



Ii is maximal if 'Yliij9ijUj is maximal. This is maximal if tij—tij '-—^{ij) jijj and (ij) :— 
argmax(jj)5fjj, hence /i = /i(tt), and lo+h and I{u) being 0((T^) approximations to 
/. Replacing all max's by min's we get /o+/i and I{u) as 0{a^) approximations to 
/. To get robust bounds we need bounds on lR = HRieft+HRright-HR joint- 

Ir < iaa,x[HRieft + HRright — Hr joint] 

— ^ R left ~^ ■'^R right ^Rjoint ~ ^R joint ~- -^R - 

Ir > min [Hr i^ft + Hr right — Hr joint] 

\, Tjlb I Tjlb w"' _L IT^^^ r'^ 

^ "-Rleft^^R right ^ R joint — ^ Rleft^ ^ Rright —■ ^ R- 

Note that for Hr we can tolerate such a crude approximation, since Hr (and H^^^^) 
are small O(cr^) corrections. In summary we have 

< I < /o + ^+ It and 
/0 + /1+ II < L < I{u) - 

4.4 Comparing edges 

For two edges a and h with no common vertex, the reliable interval containing [/,/] of 
Section 4.3 can be used separately for a and h. For edges with a common vertex the 
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results of Section 4.3 may still be used, but they may no longer be reliable or good 
from a global perspective. Consider the subgraph i—j — joint probabilities tTj^^ 
of vertices i, j, k, a Dirichlet posterior Uijk^ijt^'*'''''^ , i^ijk^ Et[Trijk\ = ^^^^^^^, 
etc. The expected mutual information between node i and j is := /(w") and 
I^:=I{u^) between j and k, where ufj = Uij+ and u'jf. = u+jk- The weight of edge a is 
w°'= [min/",max J"], where min and max are w.r.t. t°:^:=tij+. Similarly, the weight 
of edge h is = [min/*,max/*], where min and max is w.r.t. t'ji.:=t^j(^. The results 
of Section 4.3 can be used to determine the intervals. Unfortunately this procedure 
neglects the constraint t";^j — t^jj^. The correct treatment is to define larger than 
as follows: 

[w" > w^] ^ [P > for all e A] ^ mm[7" - I^] > 0. 

The crude approximation min — > min/'* — max/^ gives back the above naive 
interval comparison procedure. This shows that the naive procedure is reliable, 
but the approximation may be crude. For good estimates we proceed similar as in 
Section 4.3 to get O(a^) approximations and bounds on I"" — !^. 

mm[/"-/'']-0(o-2) o(cr2) min[/''-/']+0((T2) 



P,-ll+n{u)-ll{u) + P^'' - I'^' <^\n[P-I^ < nu)-l\u) 
jiJ^) ■= argmin[/i'«^^) - h'{u*j^) - h'{ul^^) + h'{uljf^)] 

ijk 

= arg(^^.^){mm[niin(/i'(ii*^^) - + mm(/i'(ii;^.. ) - 

and tijf^ := and, for instance, choosing t*^.. = or t*j^ = ^= K^k- 

The second representation for {ijk) shows that {ijk), and hence the bounds, can be 
computed in time 0{(P') rather than 0{d^). Note that min^ and min^ determine i 
and K as a function of j, then min^ determines j , which can be used to get i = i{j) 
and k = k{j). This lower bound on min[/"— /''] is used in the next section to robustly 
compare weights. 



5 An example 

This section illustrates the apphcation of the developed methodology to an artificial 
problem. 

The graph in Figure 3 models the domain by relationships of direct dependency, 
represented by directed arcs. Each node represents a binary (yes-no) variable that 
is associated with the probability distribution of the variable itself conditional on 
the state of the parent node. The distributions are given in Table 2. 

A model made by the graph and the probability tables, as the one above, is 
called a Bayesian network [Pea88]. We used the Bayesian network to sample units 
from the joint distribution of the variables in the graph. Each unit is a vector that 
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Figure 3: A graph that models the dependencies between the random variables of 
an artificial domain. 



Variable 


P ( variable=yes par ent =yes) 


P ( var iable=yes parent =no) 


Care of environment 


0.366 


0.366 


Low consumptions 


0.959 


0.460 


Organic farming 


0.950 


0.450 


Care of animals 


0.801 


0.332 


Low pollution 


1.000 


0.208 


Sustainable growth 


0.951 


0.200 


Vegetarianism 


0.993 


0.460 


Healthy hfestyle 


0.920 


0.300 



Table 2: Conditional probabihty distributions for the variables of the example in 
Figure 3. (The distribution of 'Care of environment' is represented in this table 
though it is actually unconditional.) 



represents a joint instance of all the variables. By the generated data set we can 
test our algorithm for the discovery of strong edges, and compare it with Chow and 
Liu's algorithm. 

The 'strong edges algorithm' is summarized for clarity in Table 3. The main 
procedure is called 'DetectStrongEdges' and it implements the exact procedure from 
Section 3.1. The comparison of edges needed by 'DetectStrongEdges' is implemented 
by the subprocedure 'TestDominance'. The test 2(a)vii there exploits the bounds 
defined in Section 4.3 (we have added superscripts a and h to the terms of the 
bounds to make it clear to which edge they refer). For edges with a common node, 
the test 2(b)vi exploits the bounds given in Section 4.4. For the dominance tests we 
have used the value 1 for the IDM hyper-parameter s (see Section 4.1). We have 
also chosen t*, = -T^r, = ^ , etc. 

Figures 4 to 7 show the progression of the models discovered by the two algo- 
rithms as more instances are read. The strong edges algorithm appears to behave 
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1. Procedure DetectStrongEdges(a set-based weighted graph G) 

(a) forest:=0; 

(b) for each edge e&E 

i. consider G' obtained from G dropping e and the edges it dominates; 

ii. if the endpoints of e are not connected in G', add e to forest; 

(c) return forest. 

2. Procedure TestDominance(edge a, edge h) 

(a) if a and h do not share nodes then (i.e., the edges are i—j and i—j) 



iii. aYnmij[h'{u*^) + h\u\j)-h'{u*^)] 

iv. am^^[h'{ul^) + h'{ul^-h'{ut^] 

-^E^i%[^'(«^)+/^'Kj)-/^'(^i)]; 
V. /r-i<^^E./^"(::S)+l^'E,/^"(S); 

vi. /r:=-i^^Ei5_/^"(a); 

vii. if /^-/^+/f-/j'+/^'^-/^"''>0, return 'true'; 



i. /o":=E./^K++)+E,/^K,+)-E.,/^K+); 

ii. /o: = Ei^«i+) + EK^(w++«)-EjK^«iJ; 

iii. /£--/[:=amin,[mini(/i'(M*^^)-/i'(<.^)) + min«(/i'(M;..)-/i'(M;^J)] 



vi. if Jg-Jg+ /i"-/i^ +/°-'''-4"''>0, return 'true'; 
(c) return 'false'. 

3. Procedure h(ii) return ml){n-\-s-\-l)—ml){nu-\-su-\-l)] 

4. Procedure h'(M) letmn %l)[n+s + l) — %l){nu + su + l)—u{n+s)%lj'{nu + su + l)-, 

5. Procedure h"(M) return —2[n+s)%l)\nu+su+l) — u{n+sy'%l)''{nu+su+l)] 

Table 3: A summary view of the strong edges algorithm. Remember that <^ = -^^^ n 
is the sample size, u... = ^^"^^^1" denotes the expectation of a certain chance, u*, the 
expectation taken for a specific value t*, of hyper-parameter t...; finally, ip denotes 
the ■0-function, described in Appendix A. 



(b) 



else (i.e., the edges arc i — j — k) 



IV, 



V, 
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more reliably than Chow and Liu's algorithm. It suspends the judgment on ambigu- 
ous cases and outputs forests. These are always composed of edges of the actual 
graph. Chow and Liu's algorithm always produces complete trees, but these mis- 
represent the actual tree until 50 instances have been read. At this point Chow and 
Liu's algorithm detects the right tree. The cautious approach implemented by the 
strong edges algorithm needs other 20 instances to produce the same complete tree. 

6 Extensions 

The methodology developed so far leads naturally to other possible extensions of 
Chow and Liu's approach. We briefly report on two different types of extensions in 
the following. 

Section 6.1 discusses the question of tree-dependency structures vs. forest- 
dependency structures under several respects. The discussion focuses both on al- 
gorithms that are alternative to the strong edges one, and that aim at yielding 
trees, and on the other hand on algorithms that emphasize the inference of forest- 
dependency structures from data. 

In Section 6.2 we extend the computation of lower and upper expectations of 
mutual information to the computation of robust credible limits. These are intervals 
for mutual information obtained from the IDM that contain the actual value with 
given probability. This result is useful in order to produce dependency structures 
that provide the user with a given guarantee level. In principle the extension to 
credible limits can be applied both to the computation of strong edges and to that 
of robust trees, as defined in the next section, although the results of Sections 6.1 
and 6.2 are actually independent, in the sense that one does not need to use them 
together. 

6.1 Forests vs. Trees 

It may be useful to critically re-consider Chow and Liu's algorithm in the following 
respect. Chow and Liu's algorithm yields always a tree by construction, and hence 
this happens also when the actual (but usually unknown) dependency structure is 
a forest. This is a questionable characteristic of the algorithm, as in the mentioned 
case yielding a tree seems to be hard to justify. There are indeed approaches in 
the literature of precise probability that suppress the edges of a maximum span- 
ning tree for which the mutual information is not large enough, yielding a forest. 
This is typically implemented using a numerical threshold £, sometimes computed 
via statistical tests. Such approaches can be used immediately also within the 
imprecise-probability framework introduced in this paper; it is sufficient to suppress 
the edges for which the upper value of mutual information [i.e., max^gvKw(e)] does 
not exceed e. In contrast with the precise-probability approach, the latter should 
have the advantage to better deal with the problem to suppress edges by mistake, as 
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a consequence of the variability of the inferred values of mutual information. This 
should be especially true once forests are inferred using the credible limits for mutual 
information introduced in the next section. 

A more subtle question is how the forests inferred using the above threshold 
procedure relate to the forests that are naturally produced by the strong edges 
algorithm in its original form. Remind that the strong edges algorithm produces a 
forest rather than a tree when there is more than one optimal tree consistent with 
the available data; indeed the algorithm aims at yielding the graphical structure 
made of the intersection of all such trees. The situation may be clarified by focusing 
on a special case: consider a problem in which the true dependency structure is a 
tree in which there are edges with the same value of mutual information, say fi. In 
this case the strong edges algorithm will never produce a tree, only a forest, also in 
the limit of infinitely many data. The reason is that there will always be multiple 
optimal trees consistent with the data, just because multiple optimal trees are a 
characteristic of the problem. In particular, there would arise a forest because some 
edges with weight /i would never belong to the set of strong edges. Now suppose 
that ii> e. In this case, the previous threshold procedure would not suppress the 
edges with mutual information equal to In other words, the two procedures 
suppress edges under different conditions: the strong edges algorithm may suppress 
edges because they have equal true values of mutual information, despite those 
values may be high; the threshold procedure only suppresses edges with low value 
of mutual information. For this reason it could make sense to apply the threshold 
procedure also as a post-processing step of the strong edges algorithm. 

The discussion so far has highhghted an interesting point. By focusing on the 
intersection of all the trees consistent with the data, the strong edges algorithm 
appears to be well suited as a tool to recover the actual dependency structure un- 
derlying the data. This is because the algorithm does not aim at recovering just 
any of the equivalent structures, rather, it focuses on the common pattern to all 
of them, which is obviously part of the actual structure. In this sense, the strong 
edges algorithm might be well suited for applications concerned with the recovery 
of causal patterns. 

On the other hand, one can think of applications for which the algorithm is prob- 
ably not so well suited. For instance, in (precise-probability) problems of pattern 
classification based on Bayesian networks [FGG97], it is important to recover any 
tree (or forest) structure for which the sum of the edge weights is maximized. In this 
case, suppressing edges with large weights only because they are not strong might 
lead to low classification accuracy. In these cases, the extension of those precise ap- 
proaches to the IDM-based inferential approach should probably follow other lines 
than those described here. One possibility could be to exploit existing results in the 
hterature of robust optimization; the work of Yaman et al. [YKPOl] seems to be 
particularly worthy of consideration. Yaman et al. consider a problem of maximum 
spanning tree for a graph with weights specified by intervals (the weights are given 
no particular interpretation), which is a special case of a set-based weighted graph. 
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They define the relative robust spanning tree as follows (using our notations): let 
T be a generic tree spanning G, and denote by T* a maximum spanning tree of 
Gyj e Q. Let resp. Su, be the sum of the edge weights of resp. T, with 
respect to the weight function w. A relative robust spanning tree T* is one that 
solves the optimization problem mmTTaax^(=]y{S^ — Sw), i-c, one that minimizes the 
largest deviation S^—S^, among all the possible graphs Gw^G- In this sense the ap- 
proach adopted by Yaman et al. is in the long tradition of the popular maximin (or 
minimax) decision criterion. Prom the computational point of view, although the 
problem is NP-complete [AVH04], recent results show that relatively large instances 
of the problem can be solved efficiently [MonOX]. The trees defined by Yaman ct 
al. could probably be combined with the IDM-based inferential approach presented 
here, suitably modified for classification problems, in order to yield relative robust 
classification trees. Here, too, it could make sense to post-process the relative robust 
trees in order to suppress edges with small upper (or even lower) values of mutual 
information, yielding a forest. 

6.2 Robust credible limits for mutual information 

In this section we develop a full inferential approach for mutual information under 
the IDM. 

An ^-credible interval for the mutual information I is an interval [X,X] which 

contains X with probability at least a, i.e., j^p{X)dX >a. We define a-credible 
intervals w.r.t. distribution pt(X) as 

[Xt,Xt] = [Et[l] - AXt , Et[X\ + AXt] such that / Pt{^)dl > a, 

J-it 

where AX* :=Xt — £'t[X] {^t'-— Et\I\—'It) is the distance from the right boundary 
it (left boundary Xt) of the a-credible interval [Xt^it] to the mean Et[T] of X under 
distribution pt. We can use 

[X,X] := [mmXt,maxXt] = [j[Xt,it] 
^ t 

as a robust credible interval, since j^Pt{I)dX> J^lpt{X)dX>a for all t. An upper 
bound for X (and similarly lower bound for X) is 

X = ma^{Et[X] + KEt) < max£'t[X] max AX* = E[X] + AX. 

Good upper bounds on I = E[X] have been derived in Section 4.3. 

For not too small n, Pt(X) is close to Gaussian due to the central limit theorem. 
So we may approximate AXt^rat with r given by a = erf(r/v^), where erf is the 
error function (e.g., r = 2 for a pa 95%) and at is the variance oi pt, keeping in mind 
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that this could be a non-conservative approximation. In order to determine AX we 
only need to estimate maxt >/Vart[I] =0{^). The variation of ^Vart[X] with t is 
of order n~^^^. If we regard this as negligibly small, we may simply fix some t* e A. 
So the robust credible interval for I can be estimated as 

J < 7 + AJ < lo + h + I^' + AJ ^ Io + h + I^' + r^/VeiTt*[I]. 

Expressions for the variance of I have been derived in [HutOl, HZ05]: 

Var,[X] = ^ E u,, ( log ' - ^ f E ^^^^ + ^ (^"') " 

Higher order corrections to the variance and higher moments have also been derived, 
but are irrelevant in light of our other approximations. In Sections 4.4 and 5 we also 
needed a lower bound on — Taking credible intervals into account we need a 
robust upper a-credible hmit for X^"- -.—X^ —X"- . Similarly as for the variance one can 
derive the following expression: 

X 1; Jq — Jq -r -'i — £j_ + -'ij ~ J-R "T 
rVVart*[X^ -X«] + 0(71"^/^), 

Vart[T'' - T«] = Vart[X''] + Vart[T«] - 2Covt[2:^ X"] , 

Variances are typically of order 1/n, so for large n, credible intervals X—X—0{l/^Jn) 
are much wider than expected intervals / — / = 0(l/n). ~ 

7 Conclusions 

This paper has tackled the problem to reliably infer trees from data. We have 
provided an exact procedure that infers strong edges in time O(m^), and have shown 
that it performs well in practice on an example problem. We have also developed 
an approximate algorithm that works in time O(m^). 

Reliability follows from using the IDM, a robust inferential model that rests 
on very weak prior assumptions. Working with the IDM involves computing lower 
and upper estimates, i.e., solving global optimization problems. These can hardly 
be tackled exactly, as they are typically non-linear and non-convex. A substantial 
part of the present work has been devoted to provide systematic approximations 
to the exact intervals with a guaranteed worst case of O(cr^). This was achieved 



22 



Marco Zaffalon and Marcus Hutter, IDSIA- 11-03 



by optimizing approximating functions, obtained by Taylor-expanding the original 
objective function. We have taken care to make these approximations conservative, 
i.e., they always include the exact interval. This is the necessary step to ultimately 
obtain over-cautious rather than overconfident models. 

More broadly speaking, the same approach has been used also for another approx- 
imation, concerned with the representation level chosen for the IDM. In principle, 
one might use the IDM for the joint realization of all the m random variables. In 
this paper we have used one IDM for each bivariate (and tri-variate, in some cases) 
realization. Using separate IDMs simplifies the treatment, but it may give rise to 
global inconsistencies (in the same lines of the discussion on comparing edges with 
a common vertex, in Section 4.4). However, their effect is only to make O strictly 
include Or, thus producing an excess of caution, as discussed in Section 3.1. 

We have already reported two developments that follow naturally from the work 
described above. The first involves the computation of robust trees, which widens 
the scope of this paper to other applications. The second is in the direction of even 
greater robustness by providing robust credibile limits for mutual information, which 
provide the user with a guarantee level on the inferred dependency structures. 

Other extensions of the present work could be considered that need further re- 
search in order to be realized. Obviously, it would be worth extending the work to 
the robust inference of more general dependency structures. This could be achieved, 
for example, in a way similar to Kleiter's work [Kle99]. One could also extend our 
approach to dependency measures other than mutual information, like the statisti- 
cal coefficient 0^ [KS67, pp. 556-561]. This would require new approximations to 
be derived for the new index under the IDM, but the first part of the paper on the 
detection of strong edges could be applied as it is. 

Another important extension could be realized by considering the inference of 
dependency structures from incomplete samples. Recent research has developed 
robust approaches to incomplete samples that make very weak assumptions on the 
mechanism responsible for the missing data [Man02, RSOl, Zaf02]. This would be 
an important step towards realism and reliability in structure inference. 

A Properties of the digamma ip function 

The digamma function ijj is defined as the logarithmic derivative of the Gamma 
function. Integral representations for ip and its derivatives are 




The h function (1) and its derivatives are h^^\u) = 

m(^V(^+s + 1) - i{n+sY-^ilj^^-^\{n+s)u + l) - u{n+syilj^^\{n+s)u + l). 
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At argument Ui = ^1^^' we get for h, h' and h" 

h{ui) = {ni+sti)['ilj{n+s+l) — 'ilj{ni+sti + l)]/{n+s), 

h'{Ui) = ■0(^ + ^ + 1) - + + - + + + 

h"{ui) = -2{n+s)i^'{ni+sti + l)-{ni+sti){n+s)i^''{ni+sti + l), 
For integral arguments the following closed representations for ip, ijj', and ijj" exist: 

^(n+l) = -7 + J]-, ^'(n+l) = ^-^-, <(n+l) = -2C(3) + 2j]- 

«=1 j=l 1=1 

where 7 = 0.5772156... is Euler's constant and C(3) = 1.202569... is Riemann's zeta 
function at 3. Closed expressions for half-integer values and fast approximations for 
arbitrary arguments also exist. The following asymptotic expansion can be used if 
one is interested in 0((^^)^) approximations only (and not rigorous bounds): 

*(.+ l) = log.-+l-J^ + 0(l). 

See [AS74] for details on the t/j function and its derivatives. Prom the above expres- 
sions one may show h" < and h'" > 0. 
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a. Strong edges algorithm b. Chow and Liu's algorithm 



Figure 4: The outputs of the two algorithms after reading 20 instances. 




a. Strong edges algorithm b. Chow and Liu's algorithm 



Figure 5: The outputs of the two algorithms after reading 30 instances. 




a. Strong edges algorithm b. Chow and Liu's algorithm 



Figure 6: The outputs of the two algorithms after reading 40 instances. 




a. Strong edges algorithm b. Chow and Liu's algorithm 



Figure 7: The outputs of the two algorithms after reading 50 instances. 



